Multitier Annotation of Urdu Speech Corpus
نویسندگان
چکیده
This paper describes the multi-level annotation process of Urdu speech corpus and its quality assessment using PRAAT. The annotation of speech corpus has been done at phoneme, word, syllable and break index levels. Phoneme, word and break index level annotation has been done manually by trained linguists whereas syllable-tier annotation has been done automatically using template matching algorithm. The mean accuracy achieved at phoneme and break index label and boundary identification is 79.07% and 89.67% respectively. The quality assessment of word and syllable tiers is still under investigation.
منابع مشابه
Semi-Semantic Part of Speech Annotation and Evaluation
This paper presents the semi-semantic part of speech annotation and its evaluation via Krippendorff’s α for the URDU.KON-TB treebank developed for the South Asian language Urdu. The part of speech annotation with the additional subcategories of morphology and semantics provides a treebank with sufficient encoded information. The corpus used is collected from the Urdu Wikipedia and news papers. ...
متن کاملBuilding a Hierarchical Annotated Corpus of Urdu: The URDU.KON-TB Treebank
This work aims at the development of a representative treebank for the South Asian language Urdu. Urdu is a comparatively under resourced language and the development of a reliable treebank for Urdu will have significant impact on the state-of-the-art for Urdu language processing. In URDU.KON-TB treebank described here, a POS tagset, a syntactic tagset and a functional tagset have been proposed...
متن کاملBuilding Computational Resources: The URDU.KON-TB Treebank and the Urdu Parser
This work presents the development of the URDU.KON-TB treebank, its annotation evaluation & guidelines and the construction of the Urdu parser for a South Asian language Urdu. Urdu is comparatively an under-resourced language and the development of a reliable treebank and a parser will have significant impact on the state-of-the-art for automatic Urdu language processing. The work includes the ...
متن کاملDeveloping a tagset for automated part-of-speech tagging in Urdu
1. Abstract While part-of-speech tagging is an established technology for Western European languages such as English or Spanish, extending the technique to Urdu presents a range of interesting issues. There are some problems associated with the writing system, e.g. the problems of locating token boundaries in the Urdu version of the Arabic script. However, there are also linguistic issues. Litt...
متن کاملDeveloping a tagset for automated part - of - speech tagging in Urdu Andrew
While part-of-speech tagging is an established technology for Western European languages such as English or Spanish, extending the technique to Urdu presents a range of interesting issues. There are some problems associated with the writing system, e.g. the problems of locating token boundaries in the Urdu version of the Arabic script. However, there are also linguistic issues. Little work has ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014